Ensemble Methods

Ensemble Methods

Plan

  1. Decision Trees

  2. Bagging (e.g. RandomForest)

  3. Boosting (e.g. GradientBoosting)

  4. Stacking

1. Decision Trees

Decision Trees are hierarchical supervised learning algorithms

They primarily help us with:

- Classification and Regression

- Non-linear & Non-parametric Modelling

- Features Selection

How Do We “Grow” a Tree?

  • Trees work with a greedy divide-and-conquer strategy to identify ways to split a data set based on different conditions (feature, threshold).
  • Learn simple decision rules (if-else) inferred from the data features.
  • The algorithm selects the “best” condition based on a specific score.
  • The algorithm recursively split the subset of data.
  • When no satisfying conditions are found, the node becomes a leaf.

DecisionTreeClassifier

Select the split that results in the most homogeneous sub-node

Gini Index

The Gini index measures the ability of each feature to separate the data. It calculates the impurity of each node, between [0, 1] - the lower the better.

\[ Gini(node)=1−∑p_i^2 \]

\(p_i\) being the ratio between the observations in a class \(i\) and the total number of observations remaining at a given node.

Predicting

A new point is passed through the tree from top to bottom until it reaches a leaf. The most represented class in that leaf is the predicted class for a given data point

Visual Intro

DecisionTreeRegressor

Ensemble Methods

Combine several decision trees to produce better predictive performance than utilizing a single decision tree.

2. Bagging (i.e Bootstrap Aggregating)

Bootstrap aggregating, also known as Bagging, is the aggregation of multiple versions of a model

  • It is a parallel ensemble method

  • The aim of bagging is to reduce variance

  • Each version of the model is called a weak learner

  • Weak learners are trained on boostrapped samples of the dataset

Bootstrapping (or Generating Bootstrapped Samples)

  • The samples are created by randomly drawing data points, with replacement

  • Features can also be randomly filtered to increase bagging diversity

Random Forests = Bagged Trees

Random Forests are a bagged ensemble of Decision Trees

Predictions are averaged in regression tasks, and voted in classification tasks

Pros and Cons of Bagging

👍 Advantages:

  • Reduces variance (overfitting)

  • Can be applied to any model

👎 Disadvantages

  • Complex structure

  • Long training time

  • Disregards the performance of individual sub-models

3. Boosting

Boosting is a method designed to train weak learners that learn from their predecessor’s mistakes

  • It is a sequential ensemble method

  • The aim of boosting is to reduce bias

  • Focuses on the observations that are harder to predict

  • The best weak learners are given more weight in the final vote

3.1 Gradient Boosting 🔥

Only implemented for trees

Instead of updating the weights of observations that were misclassified, Gradient Boosting will

  1. Recursively fit each weak learner \(d_{tree i}\) so as to predict the residuals of the previous one
  2. Add all the predictions of all weak learners

\[D(\mathbf{x}) = d_\text{tree 1}(\mathbf{x}) + d_\text{tree 2}(\mathbf{x}) + ...+ d_\text{tree n}(\mathbf{x})\]

Visual Explanation

XGBOOST (Extreme Gradient Tree Boosting)

  • Dedicated library, optimized for this task

  • Nice features inspired by Deep Learning

First, let’s split the dataset into train, test and validation sets

Splitting the Data

from sklearn.model_selection import train_test_split

# Split data into train, test and validation sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.3, random_state = 42  # TEST = 30%
)

# Use the same function above for the validation set
X_test, X_val, y_test, y_val = train_test_split(
    X_test, y_test, test_size = 0.5, random_state = 42  # TEST = 15%
)

Then, we plug the data into the XGBRegressor!

from xgboost import XGBRegressor

xgb_reg = XGBRegressor(max_depth=10, n_estimators=100, learning_rate=0.1)

xgb_reg.fit(X_train, y_train,
    # evaluate loss at each iteration
    eval_set=[(X_train, y_train), (X_val, y_val)],  
    # stop iterating when eval loss increases 5 times in a row
    early_stopping_rounds=5
)

y_pred = xgb_reg.predict(X_val)

How it works?

  1. an initial prediction is made on the whole dataset with a single tree.

  2. Residuals are computed based on the predicted value and the observed values. A decision tree is created with the residuals.

  3. The output of the tree becomes the new residual for the dataset, which is used to construct another tree.

  4. This process is repeated until the residuals stop reducing or for a specified number of times. Each subsequent tree learns from the previous trees (Boosting principle)

Optimization: Parameters and methods

XGBoost uses the following parameters and methods to optimize the algorithm and provide better results and performance:

Regularization A Regularization parameter (lambda) is used while calculating the similarity scores to reduce the sensitivity to individual data and avoid overfitting.

Pruning A Tree Complexity Parameter (gamma) is selected to compare the gains. The branch where the gain is smaller than the gamma value is removed. This prevents overfitting by trimming unnecessary branches and reducing the depth of the trees.

Weighted quantile sketch Instead of testing every possible value as the threshold for splitting the data, only weighted quantiles are used.

Parallel Learning Note: Xgboost doesn’t run multiple trees in parallel Rather it does the parallelization WITHIN a single tree to create branches independently.

Pros and Cons of Boosting

👍 Advantages:

  • Strong sub-models have more influence in final decision

  • Reduces bias

👎 Disadvantages:

  • Computationally expensive (sequential)

  • Easily overfits

  • Sensitive to outliers (too much time spent trying to correctly predict them)

Ensemble Methods Summary

Ensemble learning combines several base algorithms, like Decision Trees
Ensemble methods can be broken down into two categories:

Parallel learners: models trained in parallel; predictions are aggregated

  • RandomForestRegressors
  • BaggingRegressors

Sequential learners: models trained sequentially so as to learn from predecessors’ mistakes

  • AdaBoostRegressor
  • GradientBoostRegressor
  • XGBoostRegressor

4. Stacking

Training different estimators and aggregating their predictions

  • Different estimators (KNN, LogReg, etc.) capture different structures in the data

  • Combining sometimes enhances the predictive power

  • The results are aggregated either by voting (classification) or averaging (regression)

a) Simple aggregation

scikit-learn VotingClassifier and VotingRegressor

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression

forest = RandomForestClassifier()
logreg = LogisticRegression()

ensemble = VotingClassifier(
    estimators = [("rf", forest),("lr", logreg)],
    voting = 'soft', # to use predict_proba of each classifier before voting
    weights = [1,1] # to equally weight forest and logreg in the vote
)
ensemble.fit(X_moon, y_moon)
plot_decision_regions(X_moon, y_moon, classifier=ensemble)

🤔 How do you choose the best weights?

b) Multi-layer Stacking!

Train a final estimator on the predictions of the previous ones

scikit-learn StackingClassifier and StackingRegressor

from sklearn.ensemble import StackingClassifier
from sklearn.neighbors import KNeighborsClassifier

ensemble = StackingClassifier(
    estimators = [("rf", RandomForestClassifier()),
                  ("knn", KNeighborsClassifier(n_neighbors=10))],
    final_estimator = LogisticRegression())

ensemble.fit(X_moon, y_moon)
plot_decision_regions(X_moon, y_moon, classifier=ensemble)

Other algorithms

Quantile Random Forest

Most estimators during prediction return E(Y|X), which can be interpreted as the answer to the question, what is the expected value of your output given the input?

When a decision tree is fit, we store only the sufficient statistics of the target at the leaf node such as the mean and variance.

Quantile RF uses all the target values in the leaf node and at prediction, these are used to compute empirical quantile estimates.

Linear Trees

Linear Trees combine the learning ability of Decision Tree with the predictive and explicative power of Linear Models.

  • Like in tree-based algorithms, the data are split according to simple decision rules.
  • Fitting Linear Models in the nodes. The models in the leaves are linear instead of constant approximations like in classical Decision Trees.

Linear Forests & Linear Boosting

  • Firstly, a Linear Model is fitted on the whole dataset,
  • Then a RF or Boosting algo is trained on the same dataset but using the residuals of the previous steps as target.

The final predictions are the sum of the raw linear predictions and the residuals modeled by RF or Boosting algo.

LightGBM

Light GBM splits the tree leaf-wise with the best fit whereas other boosting algorithms split the tree depth-wise or level-wise rather than leaf-wise. In other words, Light GBM grows trees vertically while other algorithms grow trees horizontally.

Advantages of Light GBM

  • Faster training speed and higher efficiency: Light GBM uses a histogram-based algorithm i.e it buckets continuous feature values into discrete bins which fasten the training procedure.

  • Lower memory usage: Replaces continuous values to discrete bins which results in lower memory usage.

  • Almost or same results as XGB - Compatibility with Large Datasets: It is capable of performing equally well with large datasets with a significant reduction in training time as compared to XGBoost.

Disadvantages of Light GBM - Overfitting: Light GBM split the tree leaf-wise which can lead to overfitting as it produces much complex trees.

NGBoost

NGBoost enables predictive uncertainty estimation with Gradient Boosting through probabilistic predictions.

Probabilistic prediction is the approach where the model outputs a full probability distribution over the entire outcome space, instead a single point-wise estimation.

How?

By nonparametrically modeling the conditional parameters of the multivariate predictive distribution.


For example, \(Y|X\) may be assumed to follow a univariate Normal distribution where \(\mu\) and \(\sigma\) are taken to be the parameter vector \(\theta\).

The output of NGBoost is \(\theta(X) = (\mu(X), \sigma(X))\).